***Identification of CDK7 phosphorylation targets with SILAC phosphoproteomics and Syros covalent CDK7 inhibitor SY-351 The data is from Orbitrap Fusion mass spec analysis of two channel (light:heavy) SILAC phosphoproteome from SY-351 treated HL-60 cells.

This is the second of five R markdown (Rmd) files used to generate the vignettes in the SY351SILAC data package:

  1. parse_process_phosphodata_SY351_1hr_HL60_PhosphoSILAC_ZPoss_WOld.Rmd
  2. analyze_phospho_data_SY351_1hr_HL60_PhosphoSILAC_ZPoss_WOld.Rmd
  3. make_site_table_SY351_1hr_HL60_PhosphoSILAC_ZPoss_WOld.Rmd
  4. prepare_analyze_momo_motif_analysis.Rmd
  5. parse_analyze_proteomedata_SY351_1hr_HL60_PhosphoSILAC.Rmd

This vignette is part of the SY351SILAC package and documents the differential statistical analysis (empirical Bayes) of the phosphosites data frame to identify sites that change upon treatment of HL60 cells with the covalent CDK7 inhibitor. This generates the psitestats data frame. (the data frame is a tbl_df or tibble - see dplyr package and tidyverse in general for info)

In a nutshell, the biggest improvement in statistical power comes from incorporating into the model two coefficients, one for the log2 ratio of SY-351/DMSO and a coefficient corresponding to a systematic label effect, a.k.a. isotope effect, analogous to a dye swap design in 2-color microarrays and limma’s nomenclature. For this to work, ensure that the sign of ratios for the label-swap sample has not been manually inverted, for example in Persues. Otherwise the linear equations will not work. The other improvement came from filtering highly extreme null effect log2 ratios. See html document for details on threshold values – I didn’t fiddle with these very much and just picked one and went with it. There is a trade-off here with modeling the isotope effect. You give up one degree of freedom to estimate this coefficient, at the expense of degrees of freedom that would be used to estimate your residual variance. If the isotope effect wasn’t a major effect, you would observe a degradation in performance.

library(tidyverse)
library(limma)
library(lattice)
library(grid)
library(gridExtra)
library("ggpubr")
library( GGally)
library(filenamer)
options(stringsAsFactors = FALSE)
options(warn=-1)
# define functions
# 
source( DataPackageR::project_path('data-raw/fdrfunctions.R'))
source( DataPackageR::project_path('data-raw/silac_model_matrix.R'))
source( DataPackageR::project_path('data-raw/my.panel.cor.R'))
source( DataPackageR::project_path('data-raw/ggvolcano.R'))
source( DataPackageR::project_path('data-raw/get_toptable.R'))
source( DataPackageR::project_path('data-raw/extract.first.id.R'))
# get_toptable
# get.toptable <- function(fitobj, coef, glist, suffix) {
#   require(limma)
#   require(dplyr)
#   ttemp <- limma::topTable(fitobj,coef = coef,genelist = glist,number = Inf,p.value = 1)
#   ttemp <- dplyr::as_tibble(ttemp) %>% dplyr::rename( ID = Unique.identifier)
#   datinds <-  which(names(ttemp) != "ID")
#   names(ttemp)[datinds] <- paste(names(ttemp)[datinds], suffix, sep="")
#   return(ttemp)
# }

# When running this example as a standalone Rmd file, these lines will need to be changed to 
# import the phosphosites and targets data frame from elsewhere. These lines are used by
# DataPackageR to generate vigenettes from this Rmd file during data package generation 
# for the SY351SILAC data package.
phosphosites <- DataPackageR::datapackager_object_read("phosphosites")
targets <- DataPackageR::datapackager_object_read("targets")

Set up important parameters

# any psites with SILAC ratios in the null experiment (no treatment in either channel) will be removed prior to linear modeling
abs_null_log2ratio_thresh <- 0.5 
# Any psites are removed for which there are numvalid_trt_thresh or less valid ratios 
# (counting only the samples containing a treatment channel)  
numvalid_trt_thresh <- 2

# False discovery rate threshold for differential expression with SY-351 treatment
fdr.thresh <- 0.05

phseqlen <- 13 # Intended length of final phosphosequence window to submit to MoMo

Targets data frame

The target matrix maps sample names with sample SILAC ratios (column names in PhosphoSTY.txt file) and which samples were labeled with what SILAC label. Make sure the the target data frame has two columns: Light and Heavy which indicates the sample’s SILAC label. targets. e.g. for a label-swap design.

targets

Plot densities of phosphosite SILAC ratios in each L:H SILAC plex .

These distributions should be centered on a log2 ratio of zero

limma::plotDensities(phosphosites[,targets$samples])

Tabulate the number of valid values for each phosphosite

table(phosphosites$numvalid)
## 
##     2     3     4 
##  6762  6163 16614
# 2     3     4 
# 3295  4632 16614 

table(phosphosites$numvalid_trt)
## 
##     1     2     3 
##  3669  8021 17849

Normalize SILAC phosphosite ratios using limma’s cyclic loess algorithm

# Copy ratio columns to a new set of columns representing unnormalized ratios (*_nonorm)
nonorm <- function(x) {x}

phosphosites <- phosphosites %>% mutate_at(vars(targets$samples), .funs = list(nonorm = ~nonorm(.) )) 

phosphosites[,targets$samples]  <- limma::normalizeBetweenArrays(phosphosites[,targets$samples],method = "cyclicloess")

Set up model matrices to compare different models

The first design matrix “design” has just one coefficient (1st and only column) that captures the log2(SY351/DMSO) effect. The next design matrix “design.isotopeffect” The first coeficient (first column) is the systematic “isotope effect”, which in this context is the systematic difference in the Heavy vs. Light effect superimposed on all the other treatments, modeled as an intercept coefficient. The second coefficient (column 2) is the log2(SY351/DMSO) effect. This alternative design has improved statistical power due to explicit modeling of the isotope labeling effect. We hypothesize that this effect, which we have observed in independent experiments with multiple cell lines, is due to protein expression changes in the independent cell populations that diverge during the amino acid labeling.

design <- silac_model_matrix(targets, ref = "DMSO")
## Found unique target names:
##  DMSO SY351
design
##      SY351
## [1,]     0
## [2,]     1
## [3,]     1
## [4,]    -1
design.isotopeffect <- cbind(rep(1,length(targets$samples)), design)
colnames(design.isotopeffect) <- c("isotope","SY351")
design.isotopeffect
##      isotope SY351
## [1,]       1     0
## [2,]       1     1
## [3,]       1     1
## [4,]       1    -1
# In case NA's are not filtered out in source file from Perseus, this condition is used to only 
# use rows that have at least 2 valid log2 ratios
selectedrows <- phosphosites$numvalid_trt > numvalid_trt_thresh

# Assign a variable to index those rows with at least 2 valid values and null log2ratios within a range
# defined by the  abs_null_log2ratio_thresh parameter

selectedrows.nullfilt <- selectedrows &  
  !is.na(phosphosites$null) & 
  abs(phosphosites$null) < abs_null_log2ratio_thresh

sum(selectedrows.nullfilt)
## [1] 15061
#[1] 19012

Perform linear modeling with limma’s lmFit.

Here we set up all the fitted model objects and eBayes fitting, with the goal to compare the different models as specified in the two different design matrices, design and design.isotopeffect. Each of these is tested in combination with a row filter that tests the exclusion of phosphosites with large isotope effects as seen by extreme SILAC ratios in the null condition, in which both channels are vehicle treated.

syrlm <- lmFit(phosphosites[selectedrows,targets$samples],design)
syrlm$Amean <- log10(phosphosites$Intensity[selectedrows])

syrlm.nullfilt <- lmFit(phosphosites[selectedrows.nullfilt,targets$samples],design)
syrlm.nullfilt$Amean <- log10(phosphosites$Intensity[selectedrows.nullfilt])

syrlm.isotopeffect <- lmFit(phosphosites[selectedrows,targets$samples],design.isotopeffect)
syrlm.isotopeffect$Amean <- log10(phosphosites$Intensity[selectedrows])

syrlm.nullfiltisotope <- lmFit(phosphosites[selectedrows.nullfilt,targets$samples],design.isotopeffect)
syrlm.nullfiltisotope$Amean <- log10(phosphosites$Intensity[selectedrows.nullfilt])

fit.simp <- eBayes(syrlm, trend = T)
fit.notrend <- eBayes(syrlm, trend = F)
fit2.isotopeffect <- eBayes(syrlm.isotopeffect, trend = T)
fit2.nullfilt <- eBayes(syrlm.nullfilt, trend = T)
fit2.nullfiltisotope <- eBayes(syrlm.nullfiltisotope, trend = T)

Compare models with Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC)

The output here indicates which is the best fitting model (simple vs. two-coefficient model with isotope effect. This uses limma’s selectModel() function, which chooses, for each probe, the best fitting model out of a set of alternative models represented by a list of design matrices. Selection is by Akaike’s Information Criterion (AIC) and Bayesian Information Criterion (BIC). Best performance seen with the model that incorporates the isotope effect.

designlist <- list(simple = design, isoeffect = design.isotopeffect)
datadf <- phosphosites[selectedrows,targets$samples]
datadfvalids <- apply(datadf, 1, function(x) {sum(!is.na(x))} )
table(datadfvalids)
## datadfvalids
##     3     4 
##  1235 16614
out.aic <- selectModel(datadf[datadfvalids == 4,], designlist, criterion = 'aic')
out.bic <- selectModel(datadf[datadfvalids == 4,], designlist, criterion = 'bic')

table(out.aic$pref)
## 
##    simple isoeffect 
##      5304     11310
table(out.bic$pref)
## 
##    simple isoeffect 
##      4335     12279
barplot(table(out.aic$pref), main = 'Akaike Information Criterion (AIC)', 
        ylab = '# sites with best score')
barplot(table(out.bic$pref), main = 'Bayesian Information Criterion (BIC)', 
        ylab = '# sites with best score')

Generate the tables of phosphosites with differential expression statistics for each model.

Return tables of phosphosites with moderated t-test statistics from empirical Bayes analysis. The first coefficient in the isotope effect models is an intercept, which represents the isotope effect. The second coefficient captures the log2(SY351/DMSO) effect. If you want to see the isotope label effect, you could call the function with coef = 1 in those cases.

meta.names <- c("Unique.identifier",targets$samples,'uniprot.acc.leading', 
                'Gene.name',"psite", "gene.psite", "Sequence.window", "proteinclass",
                'Position',"Amino.acid", "PhosphoSitePlus.kinase","Protein.name", 'Gene.names')

#meta.names[!meta.names %in% names(phosphosites)]

sy351.nullfiltisotope <- get_toptable(fit2.nullfiltisotope, coef = 2, 
                                      glist = phosphosites[selectedrows.nullfilt,meta.names[1]], suffix = ".nullfiltisotope")

sy351.isotopeffect <- get_toptable(fit2.isotopeffect, coef = 2, 
                                   glist = phosphosites[selectedrows,meta.names[1]], suffix = ".isotopeffect")

sy351.isotopeffect.coef1 <- get_toptable(fit2.isotopeffect, coef = 1, 
                                         glist = phosphosites[selectedrows,meta.names[1]], suffix = ".Coef1isotopeffect")

sy351.simp <- get_toptable(fit.simp, coef = 1, 
                           glist = phosphosites[selectedrows,meta.names[1]], suffix = ".simp")

sy351.notrend <- get_toptable(fit.notrend, coef = 1, 
                              glist = phosphosites[selectedrows,meta.names[1]], suffix = ".notrend")

sy351.nullfilt <- get_toptable(fit2.nullfilt, coef = 1, 
                               glist = phosphosites[selectedrows.nullfilt,meta.names[1]], suffix = ".nullfilt")
sum(sy351.simp$adj.P.Val.simp < fdr.thresh,na.rm = T)
## [1] 308
sum(sy351.notrend$adj.P.Val.notrend < fdr.thresh,na.rm = T)
## [1] 246
sum(sy351.isotopeffect$adj.P.Val.isotopeffect < fdr.thresh,na.rm = T)
## [1] 1210
sum(sy351.isotopeffect.coef1$adj.P.Val.Coef1isotopeffect < fdr.thresh,na.rm = T)
## [1] 2440
sum(sy351.isotopeffect.coef1$adj.P.Val.Coef1isotopeffect < fdr.thresh & 
      abs(sy351.isotopeffect.coef1$logFC.Coef1isotopeffect) > abs_null_log2ratio_thresh,na.rm = T)
## [1] 668
sum(sy351.nullfilt$adj.P.Val.nullfilt < fdr.thresh,na.rm = T)
## [1] 584
sum(sy351.nullfiltisotope$adj.P.Val.nullfiltisotope < fdr.thresh,na.rm = T)
## [1] 1435
sum(sy351.nullfiltisotope$adj.P.Val.nullfiltisotope < fdr.thresh &
        abs(sy351.nullfiltisotope$logFC.nullfiltisotope) > abs_null_log2ratio_thresh,
    na.rm  =  T)
## [1] 608
# [1] 308
# [1] 246
# [1] 1210
# [1] 2440
# [1] 668
# [1] 584
# [1] 1435
# [1] 608

Join all tables together

Join all these together to make the venn diagram analysis easier and for a final output table. Start with the best model as the first table to join

psitestats <- sy351.nullfiltisotope %>%
    dplyr::full_join(sy351.isotopeffect, by = 'ID') %>%
    dplyr::full_join(sy351.nullfilt, by = 'ID') %>%
    dplyr::full_join(sy351.simp, by = 'ID') %>%
    dplyr::full_join(sy351.notrend, by = 'ID') %>% 
  dplyr::as_tibble()


logfcnms <- grep("logFC",names(psitestats), value = T)
pvalnms <-  grep("^P.Val",names(psitestats), value = T)
adjpvalnms <-  grep("^adj.P.Val",names(psitestats), value = T)

psitestats <- psitestats[, names(psitestats) %in% c("ID",logfcnms,adjpvalnms, pvalnms)]
psitestats <- psitestats %>% left_join(phosphosites[selectedrows, meta.names], by = c("ID" = "Unique.identifier"))

syrdatnms <- c("ID", "logFC.nullfiltisotope", "P.Value.nullfiltisotope", "adj.P.Val.nullfiltisotope", 
               "null", "rep1", "rep2", "rep3LF", "Gene.name", "Gene.names",  "proteinclass",
               "psite", "gene.psite", "Sequence.window", "Position", "Amino.acid", "PhosphoSitePlus.kinase", 
               "uniprot.acc.leading",  "Protein.name",
               "logFC.isotopeffect", "P.Value.isotopeffect", "adj.P.Val.isotopeffect", 
               "logFC.nullfilt", "P.Value.nullfilt", "adj.P.Val.nullfilt", "logFC.simp", 
               "P.Value.simp", "adj.P.Val.simp", "logFC.notrend", "P.Value.notrend", 
               "adj.P.Val.notrend")

syrdatnms[!syrdatnms %in% names(psitestats)]
## character(0)
names(psitestats)[names(psitestats) %in% syrdatnms]
##  [1] "ID"                        "logFC.nullfiltisotope"     "P.Value.nullfiltisotope"   "adj.P.Val.nullfiltisotope" "logFC.isotopeffect"       
##  [6] "P.Value.isotopeffect"      "adj.P.Val.isotopeffect"    "logFC.nullfilt"            "P.Value.nullfilt"          "adj.P.Val.nullfilt"       
## [11] "logFC.simp"                "P.Value.simp"              "adj.P.Val.simp"            "logFC.notrend"             "P.Value.notrend"          
## [16] "adj.P.Val.notrend"         "null"                      "rep1"                      "rep2"                      "rep3LF"                   
## [21] "uniprot.acc.leading"       "Gene.name"                 "psite"                     "gene.psite"                "Sequence.window"          
## [26] "proteinclass"              "Position"                  "Amino.acid"                "PhosphoSitePlus.kinase"    "Protein.name"             
## [31] "Gene.names"
syrdatnms <- syrdatnms[syrdatnms %in% names(psitestats)]

psitestats <- psitestats %>% dplyr::select(all_of(syrdatnms))

psitestats[,pvalnms][is.na(psitestats[,pvalnms])] <- 1
psitestats[,adjpvalnms][is.na(psitestats[,adjpvalnms])] <- 1

# extract.first.id is defined in the fdrfunctions.R file

#psitestats <- psitestats %>% tibble::add_column(Gene.name = extract.first.id(psitestats$Gene.names))

#psitestats$gene.psite <- paste(psitestats$Gene.name, psitestats$psite, sep=".")

psitestats
kinmults <- strsplit( psitestats$PhosphoSitePlus.kinase, ";")
#table(sapply(kinmults, length))

abrevkinases <- function(kins) {
    
    if(length(kins) == 0) {
      kinr <- ""  
    } else if (length(kins) == 1) {
        kinr <-  paste("<-", kins[1],sep = '', collapse = "") 
    } else if (length(kins) == 2) {
        kinr <- paste("<-", kins[1], ";", kins[2],sep = '', collapse = '')  
    } else if (length(kins) > 2) {
        kinr <- paste(c("<-", kins[1], "...", kins[length(kins)]), sep = '',collapse = '')
    }
}    

kinabbrev <- lapply(kinmults, abrevkinases)

psitestats <- psitestats %>% tibble::add_column( kinabbrev = unlist(kinabbrev))

psitestats <- psitestats %>% tibble::add_column( gene.psite.kinase =  
                                                   paste(psitestats$gene.psite, psitestats$kinabbrev, sep=""))

psitestats$gene.psite.kinase[psitestats$kinabbrev == ''] <-  psitestats$gene.psite[psitestats$kinabbrev == ''] 

psitestats <- psitestats %>% left_join(phosphosites %>% 
                               select(Unique.identifier, Multiplicity:Intensity.H), 
                             by = c("ID" = "Unique.identifier"))


count.unique <- function(x) {
  length(unique(x))
}

columnuniques <- lapply(psitestats, count.unique)

#lapply(psitestats[columnuniques == 1], function(x) unique(x))

psitestats <- psitestats[columnuniques > 1]

psitestats

Venn diagram analysis to compare different models

Shown below is a venn diagram of significant sites that overlap in three different models. The best is the “dye effect+null filtered” model, at 1435 sites with qval < 0.05. The dye effect alone does not use any filtering of extreme null ratios, and the null filtered method filters extreme null ratios but does not model a dye effect. The simple method that does not do either and only estimates the log2 (SY-351/DMSO) effect is not included here, as it results in very few (<300) sites that pass threshold, due to the un-modeled systematic error from the SILAC label effect. The trend = T parameter to eBayes() uses mean intensity of the SILAC triplet to condition the eBayes significance estimates and trend=T improves for all models, so was only tested with the simple model. See html file for more details.

vsyr <- as.matrix(psitestats[,grep("adj.P.Val",names(psitestats), value = T)])

vsyrsig <- (vsyr < fdr.thresh ) + 0

vsyrsig[is.na(vsyrsig)] <- 0

apply(vsyrsig, 2, table)
##   adj.P.Val.nullfiltisotope adj.P.Val.isotopeffect adj.P.Val.nullfilt adj.P.Val.simp adj.P.Val.notrend
## 0                     16414                  16639              17265          17541             17603
## 1                      1435                   1210                584            308               246
selind <- vsyrsig[,2] == 0 & vsyrsig[,1] == 1 & vsyrsig[,3] == 0

selind2 <- vsyrsig[,2] == 1 & vsyrsig[,1] == 0 

cols2venn <- c(1:3)

#+ fig.width=8, fig.height=8
vennDiagram(vennCounts(vsyrsig[,cols2venn]), main = paste("qval < ", fdr.thresh, sep = ""),
            names = c("Isotope effect &\nnull filtered","Isotope\neffect only",  "Null filtered only"))

apply(vsyrsig, 2, table)
##   adj.P.Val.nullfiltisotope adj.P.Val.isotopeffect adj.P.Val.nullfilt adj.P.Val.simp adj.P.Val.notrend
## 0                     16414                  16639              17265          17541             17603
## 1                      1435                   1210                584            308               246
vennCounts(vsyrsig[,cols2venn])
##   adj.P.Val.nullfiltisotope adj.P.Val.isotopeffect adj.P.Val.nullfilt Counts
## 1                         0                      0                  0  16249
## 2                         0                      0                  1      0
## 3                         0                      1                  0    165
## 4                         0                      1                  1      0
## 5                         1                      0                  0    378
## 6                         1                      0                  1     12
## 7                         1                      1                  0    473
## 8                         1                      1                  1    572
## attr(,"class")
## [1] "VennCounts"
vennpdffilename <-  as.character(filenamer::filename('Syros_venn_modelcompare_overlap_minvalid3', ext = 'pdf', subdir = F))
filenamer::make_path(vennpdffilename)

# Uncomment to make a pdf plot
pdf(file = vennpdffilename, width = 8, height = 8)
limma::vennDiagram(vennCounts(vsyrsig[,cols2venn]), main = paste("qval < ", fdr.thresh, sep = ""),
             names = c("Label effect &\nnull filtered","Label effect only",  "null filtered only"))
 
dev.off()
## png 
##   2

Volcano plot of differentially expressed phosphorylation sites in response to SY-351 treatment in HL-60 cells

psitestats <- psitestats %>% dplyr::mutate(logp = -log10(P.Value.nullfiltisotope))

mygglist <- ggvolcano(psitestats, xvar =  logFC.nullfiltisotope, 
                      yvar = logp, 
                      fdr.range.cut = c(0,0.01,1),
                      adjpvalvar = adj.P.Val.nullfiltisotope,
                      labelvar = gene.psite, 
                      label.fdr.thresh = 0.0025,
                      high.sig.threshold = 0.001, 
                      xlimits = c(-3.5,3.5), 
                      ylimits = c(0,8), 
                      annotgenes = T,
                      xtitle = 'log2(SY351/DMSO)',
                      ytitle = 'log10(p-value)')

 
mygglist$ggplot_volcano

mygglist$plotly_volcano
ggsave(mygglist$ggplot_volcano, device = "pdf", filename = 'sy351_SILAC_phosphosites_hl60_volcanoplot.pdf')

Scatterplot matrix showing the isotope-label swap design and isotope bias

The isotope-label flip is seen as anti-correlation with the non-swapped conditions, and the isotope bias effect is seen as a non-zero correlation in the null comparisions, and the superposition of these correlated sites in the flipped sample (rep3LF)

gg.fdr.range.cut <-  c(0,0.01,1)
adjpval.ranges <- cut(psitestats$adj.P.Val.nullfiltisotope, gg.fdr.range.cut, include.lowest = T)  


psitestats <- psitestats %>% 
  mutate(gg.fdr.range  = factor(adjpval.ranges,
                                    levels = levels(adjpval.ranges),ordered = T,
                                    labels =  c( paste(gg.fdr.range.cut[1],'< q <=',gg.fdr.range.cut[2],sep = ''),
                                                 paste(gg.fdr.range.cut[2],'< q <=',gg.fdr.range.cut[3],sep = ''))))
  


psitestats <- psitestats %>% add_column(alph = 0)

psitestats$alph[psitestats$gg.fdr.range == '0.01< q <=1'] <- 0.2
psitestats$alph[psitestats$gg.fdr.range == '0< q <=0.01'] <- 1

my_dens <- function(data, mapping, ...) {
  ggplot(data = data, mapping = mapping) +
    geom_density(..., alpha = 0.7, color = NA) 
}

psitestats$gg.fdr.range <- factor(psitestats$gg.fdr.range ,
       levels = rev(levels(psitestats$gg.fdr.range )),ordered = T)

pm <- GGally::ggpairs(psitestats, 
                      mapping = aes(color = gg.fdr.range, fill = gg.fdr.range, alpha = alph), 
              columns = c("null", "rep1", "rep2", "rep3LF"),
              title = 'SY351-treated HL60 Phosphorylation Site Log2 Ratios',
              xlab = 'log2(SY351/DMSO)', ylab = 'log2(SY351/DMSO)',
              upper = list(continuous = function(data, mapping, ...) {
                ggally_cor(data = data, mapping = mapping, size = 2) + 
                  scale_colour_manual(values = c("black","red"))
              }),
              diag = list(continuous = function(data, mapping, ...) {
                ggally_densityDiag(data = data, mapping = mapping, alpha = 0.5) +
                  scale_fill_manual(values = c("black","red")) +
                  scale_color_manual(values = c("black","black")) +
                  coord_cartesian(xlim = c(-3,3))
              }),
              lower = list(continuous = function(data, mapping, ...) {
                ggally_smooth(data = data, mapping = mapping) + 
                  scale_colour_manual(values = c("black", "red")) +
                  coord_cartesian(xlim = c(-3,3),
                                  ylim = c(-3,3)) 
                }))


pm <- pm + theme_bw()
pm

ggsave(pm <- pm + theme_bw(), device = "pdf", filename = 'sy351_SILAC_phosphosites_hl60_scatterplotmatrix.pdf')

Reformat the Sequence.window column to select first (if multiples concatenated with “;”) and lower case the phosphorylated residue

firstseqs <-  extract.first.id(psitestats$Sequence.window)

aas <- sort(unique(unlist(strsplit(firstseqs, ""))))
aas
##  [1] "_" "A" "C" "D" "E" "F" "G" "H" "I" "K" "L" "M" "N" "P" "Q" "R" "S" "T" "U" "V" "W" "Y"
# if there are U's or _, uncomment below lines if sending to momo
# 
if (any(aas == "U")) {
  firstseqs <- gsub("U", "X", firstseqs)
}
if (any(aas == "_")) {
  firstseqs <- gsub("_", "X", firstseqs)
}

#

phrespos <- (unique(nchar(firstseqs)) + 1)/2


firstseqs <- substr(firstseqs, start = phrespos - ((phseqlen - 1)/2), 
                    stop = phrespos + ((phseqlen - 1)/2))

newphrespos <- (unique(nchar(firstseqs)) + 1)/2

pseqs.psitelower <- paste(substr(firstseqs, start = 1, stop = newphrespos - 1),
                          tolower(substr(firstseqs, start = newphrespos, stop = newphrespos)),
                          substr(firstseqs, start = newphrespos + 1, stop = phseqlen),sep = "")

psitestats$Seqwin <-  pseqs.psitelower

psitestats %>% dplyr::filter(Gene.name == "CDK12") %>% 
  select(Gene.names, psite, logFC.nullfiltisotope, adj.P.Val.nullfiltisotope, Seqwin,
         PhosphoSitePlus.kinase, Pfam.domains, 
         Regulatory.site.function, domain, Localization.prob, Sequence.window)
# psitestats %>% dplyr::filter(grepl("kinase", domain, ignore.case = T)) %>% pull(domain) %>% unique() %>% sort() 
# 
# psitestats %>% dplyr::filter(grepl("kinase", domain, ignore.case = T)) %>% pull(Protein.names) %>% unique() %>% sort() 
# 
# psitestats %>% dplyr::filter(grepl("kinase", Protein.names, ignore.case = T)) %>% pull(domain) %>% unique() %>% sort() 
# 
# psitestats %>% dplyr::filter(Protein.names == "TGF-beta receptor type-2") %>% print(width = Inf)

Identify and remove duplicates on the Protein, gene.name, psite and sequence window level

# Take the most significant ratio for cases where groups of multi-phosphorylated peptides
# for the same site result in multiple quantified site ratios.

# First check to find the sites with multiples. 

psitestats <- psitestats %>%  add_count(uniprot.acc.leading, Gene.name, psite, Seqwin,  name = "ndupsites") %>% 
  arrange(desc(ndupsites), uniprot.acc.leading, Gene.name, psite)

psitestats %>% dplyr::filter(ndupsites > 1) %>%
  print(n = 10)
## # A tibble: 4,400 x 89
##    ID    logFC.nullfilti~ P.Value.nullfil~ adj.P.Val.nullf~    null   rep1    rep2 rep3LF Gene.name Gene.names proteinclass psite gene.psite Sequence.window Position
##    <chr>            <dbl>            <dbl>            <dbl>   <dbl>  <dbl>   <dbl>  <dbl> <chr>     <chr>      <chr>        <chr> <chr>      <chr>              <int>
##  1 UID1~           -0.295       0.00764             0.0648   0.321  -0.198 -0.194   0.349 MED14     MED14      ""           S1136 MED14.S11~ AARMPGMSPANPSL~     1136
##  2 UID3~            0.474       0.000270            0.0111  -0.0531  0.579  0.469  -0.404 MED14     MED14      ""           S1136 MED14.S11~ AARMPGMSPANPSL~     1136
##  3 UID6~            0.997       0.00000412          0.00208 -0.180   0.952  0.801  -1.10  MED14     MED14      ""           S1136 MED14.S11~ AARMPGMSPANPSL~     1136
##  4 UID1~            0.641       0.0000361           0.00439 -0.0926  0.652  0.511  -0.693 MED14     MED14      ""           S1142 MED14.S11~ MSPANPSLHSPVPD~     1142
##  5 UID3~            0.877       0.00000591          0.00235 -0.0859  0.820  0.686  -1.01  MED14     MED14      ""           S1142 MED14.S11~ MSPANPSLHSPVPD~     1142
##  6 UID6~           NA           1                   1       NA       0.958  0.805  -1.11  MED14     MED14      ""           S1142 MED14.S11~ MSPANPSLHSPVPD~     1142
##  7 UID1~            0.490       0.00878             0.0695  -0.147   0.721  0.0522 -0.584 DKC1      DKC1       ""           S451  DKC1.S451  QVVAEAAKTAKRKR~      451
##  8 UID6~            0.640       0.0119              0.0819  -0.422   1.05   0.283  -0.524 DKC1      DKC1       ""           S451  DKC1.S451  QVVAEAAKTAKRKR~      451
##  9 UID3~           NA           1                   1       -0.506   0.882  0.159  -0.616 DKC1      DKC1       ""           S451  DKC1.S451  QVVAEAAKTAKRKR~      451
## 10 UID1~            0.474       0.00167             0.0279   0.0333  0.717  0.300  -0.439 DKC1      DKC1       ""           S453  DKC1.S453  VAEAAKTAKRKRES~      453
## # ... with 4,390 more rows, and 74 more variables: Amino.acid <chr>, PhosphoSitePlus.kinase <chr>, uniprot.acc.leading <chr>, Protein.name <chr>,
## #   logFC.isotopeffect <dbl>, P.Value.isotopeffect <dbl>, adj.P.Val.isotopeffect <dbl>, logFC.nullfilt <dbl>, P.Value.nullfilt <dbl>, adj.P.Val.nullfilt <dbl>,
## #   logFC.simp <dbl>, P.Value.simp <dbl>, adj.P.Val.simp <dbl>, logFC.notrend <dbl>, P.Value.notrend <dbl>, adj.P.Val.notrend <dbl>, kinabbrev <chr>,
## #   gene.psite.kinase <chr>, Multiplicity <chr>, Known.site <chr>, Origin <chr>, Regulatory.site <chr>, Regulatory.site.function <chr>,
## #   Regulatory.site.process <chr>, Regulatory.site.protInteract <chr>, Regulatory.site.otherInteract <chr>, Regulatory.site.notes <chr>, Motifs <chr>,
## #   Pfam.domains <chr>, active.site <chr>, binding.site <chr>, calcium.binding.region <chr>, chain <chr>, coiled.coil.region <chr>,
## #   compositionally.biased.region <chr>, disulfide.bond <chr>, dna.binding.region <chr>, domain <chr>, glycosylation.site <chr>, helix <chr>,
## #   metal.ion.binding.site <chr>, modified.residue <chr>, mutagenesis.site <chr>, nucleotide.phosphate.binding.region <chr>, peptide <chr>, propeptide <chr>,
## #   region.of.interest <chr>, repeat. <chr>, sequence.conflict <chr>, sequence.variant <chr>, short.sequence.motif <chr>, signal.peptide <chr>, site <chr>,
## #   splice.variant <chr>, strand <chr>, topological.domain <chr>, transit.peptide <chr>, transmembrane.region <chr>, turn <chr>, zinc.finger.region <chr>,
## #   Localization.prob <dbl>, PEP <dbl>, Score <dbl>, Delta.score <dbl>, Score.for.localization <dbl>, Mass.error..ppm. <dbl>, Intensity <dbl>, Intensity.L <dbl>,
## #   Intensity.H <dbl>, logp <dbl>, gg.fdr.range <ord>, alph <dbl>, Seqwin <chr>, ndupsites <int>
# first check and make sure that there are no NA's in the p-value column that
# will be used to select the row in each duplicate group (Protein, Gene.name, psite, Seqwin)

sum(is.na(psitestats$P.Value.nullfiltisotope))
## [1] 0
# > sum(is.na(psitestats$P.Value.nullfiltdye))
# [1] 0

psitestats %>% group_by(uniprot.acc.leading, Gene.name, psite, Seqwin) %>% mutate(ndupsiterank = dense_rank(adj.P.Val.nullfiltisotope)) %>%
  pull(ndupsiterank) %>% table()
## .
##     1     2     3 
## 15642  2032   175

select most significant row from multiplicitous sites

(due to same peptide observed with more than one phosphoforms) quantified and tested

psitestats <- psitestats %>% group_by(uniprot.acc.leading, Gene.name, psite, Seqwin) %>% 
  mutate(ndupsiterank = row_number(adj.P.Val.nullfiltisotope)) %>% 
  dplyr::filter(ndupsiterank == 1) %>% ungroup()
# 15,555 rows
###

# Should be no values in n column > 1
psitestats %>% count(uniprot.acc.leading, Gene.name, psite) %>% ungroup() %>% arrange(desc(n)) %>% print(n = 10)
## # A tibble: 15,555 x 4
##    uniprot.acc.leading Gene.name psite     n
##    <chr>               <chr>     <chr> <int>
##  1 A0AVK6              E2F8      S225      1
##  2 A0AVK6              E2F8      S71       1
##  3 A0AVT1              UBA6      S36       1
##  4 A0AVT1              UBA6      S737      1
##  5 A0FGR8-2            ESYT2     S663      1
##  6 A0FGR8-2            ESYT2     S665      1
##  7 A0FGR8-2            ESYT2     S671      1
##  8 A0FGR8-2            ESYT2     S710      1
##  9 A0FGR8-2            ESYT2     S711      1
## 10 A0FGR8-2            ESYT2     S727      1
## # ... with 1.554e+04 more rows
psitestats %>% count(Gene.name, psite) %>% arrange(desc(n)) %>% print(n = 50)
## # A tibble: 15,516 x 3
##    Gene.name psite     n
##    <chr>     <chr> <int>
##  1 ""        S28       2
##  2 "ARMC10"  S45       2
##  3 "BAP18"   S35       2
##  4 "BAP18"   S36       2
##  5 "CD33"    S307      2
##  6 "CDK12"   T1244     2
##  7 "CDK12"   T1246     2
##  8 "DBNL"    S232      2
##  9 "GIGYF2"  S236      2
## 10 "GMIP"    S425      2
## 11 "GMIP"    S437      2
## 12 "GMIP"    T414      2
## 13 "HDGFRP2" S633      2
## 14 "HN1"     S87       2
## 15 "HN1"     S88       2
## 16 "HN1"     S91       2
## 17 "HN1"     S92       2
## 18 "ILF3"    S476      2
## 19 "ILF3"    S477      2
## 20 "ILF3"    S482      2
## 21 "ILF3"    S503      2
## 22 "ILF3"    T486      2
## 23 "KAT5"    S119      2
## 24 "KAT5"    S123      2
## 25 "MAX"     S2        2
## 26 "MFF"     S146      2
## 27 "MKI67"   S127      2
## 28 "MKI67"   S128      2
## 29 "NIPBL"   S2672     2
## 30 "PML"     S403      2
## 31 "PML"     S518      2
## 32 "PML"     S527      2
## 33 "PPM1B"   S376      2
## 34 "RTN4"    S181      2
## 35 "RTN4"    S182      2
## 36 "SERBP1"  S219      2
## 37 "SERBP1"  S221      2
## 38 "TMPO"    S180      2
## 39 "TMPO"    S184      2
## 40 ""        S113      1
## 41 ""        S13       1
## 42 ""        S153      1
## 43 ""        S155      1
## 44 ""        S157      1
## 45 ""        S340      1
## 46 ""        S71       1
## 47 ""        S79       1
## 48 ""        S82       1
## 49 ""        S9        1
## 50 ""        S93       1
## # ... with 1.547e+04 more rows
psitestats %>% count(Gene.name, psite, Multiplicity ) %>% arrange(desc(n)) %>% print(n = 10)
## # A tibble: 15,523 x 4
##    Gene.name psite Multiplicity     n
##    <chr>     <chr> <chr>        <int>
##  1 ""        S28   ___1             2
##  2 "ARMC10"  S45   ___1             2
##  3 "BAP18"   S35   ___1             2
##  4 "BAP18"   S36   ___1             2
##  5 "CD33"    S307  ___1             2
##  6 "CDK12"   T1244 ___1             2
##  7 "CDK12"   T1246 ___1             2
##  8 "DBNL"    S232  ___1             2
##  9 "GIGYF2"  S236  ___1             2
## 10 "GMIP"    S425  ___1             2
## # ... with 1.551e+04 more rows

Write out data to files in a subdirectory named for current date

Change chunk option eval = T if you’d like to write out these files

sy351tabfile <-  as.character(filenamer::filename('SY351_HL60_SILAC_eBayes_allmodelcompare_3valid', ext = 'txt', subdir = T))
filenamer::make_path(sy351tabfile)

write.table(psitestats,
            file = sy351tabfile,
            quote = F,
            sep = '\t',
            na = "",
            row.names = F,
            col.names = T)

write.table(names(psitestats), 
            file = as.character(filenamer::filename('sy351_colnames_metadata_3valid', ext = 'txt', subdir = T)),
            col.names = F, 
            row.names = F, 
            quote = F)


## For Chia-Yu's kinase substrate network cytoscape analysis
dim(psitestats[!is.na(psitestats$logFC.nullfiltisotope) & !is.na(psitestats$Gene.names) & psitestats$Gene.names != "",])


write.table(psitestats[!is.na(psitestats$logFC.nullfiltisotope) & !is.na(psitestats$Gene.names) & psitestats$Gene.names != "",],
            file = as.character(filenamer::filename('SY351_HL60_SILAC_eBayes_allmodelcompare_3valid_ForCYenKSR', ext = 'txt', subdir = T)),
            quote = F,
            sep = '\t',
            na = "",
            row.names = F,
            col.names = T)


syrgenesall05 <- unique(psitestats$Gene.names[psitestats$adj.P.Val.nullfiltisotope < 0.05 & !is.na(psitestats$adj.P.Val.nullfiltisotope)])
syrgenesall05 <- gsub(";.*","", syrgenesall05)

write.table(syrgenesall05, file = as.character(filenamer::filename('syrgenes_upanddown_qval05', ext = 'txt', subdir = T)), 
            quote = F, row.names = F, col.names = F)


syrgenesall05.fc0p5 <- unique(psitestats$Gene.names[psitestats$adj.P.Val.nullfiltisotope < 0.05 & !is.na(psitestats$adj.P.Val.nullfiltisotope) &
                                                  !is.na(psitestats$logFC.nullfiltisotope) & abs(psitestats$logFC.nullfiltisotope) >= log2(1.5) ])
syrgenesall05.fc0p5 <- gsub(";.*","", syrgenesall05.fc0p5)

write.table(syrgenesall05.fc0p5, 
            file = as.character(filenamer::filename('syrgenes_upanddown_qval05_foldchange50pct', ext = 'txt', subdir = T)), 
            quote = F, row.names = F, col.names = F)


syrgenesall01 <- unique(psitestats$Gene.names[psitestats$adj.P.Val.nullfiltisotope < 0.01 & !is.na(psitestats$adj.P.Val.nullfiltisotope)])
syrgenesall01 <- gsub(";.*","", syrgenesall01)

write.table(syrgenesall01, file = as.character(filenamer::filename('syrgenes_upanddown_qval01', ext = 'txt', subdir = T)), 
            quote = F, row.names = F, col.names = F)


syrgenesdown05 <- unique(psitestats$Gene.names[psitestats$adj.P.Val.nullfiltisotope < 0.05 & !is.na(psitestats$adj.P.Val.nullfiltisotope) &
                                           !is.na(psitestats$logFC.nullfiltisotope) & psitestats$logFC.nullfiltisotope < 0])
syrgenesdown05 <- gsub(";.*","", syrgenesdown05)
write.table(syrgenesdown05, file = as.character(filenamer::filename('syrgenes_down_qval05', ext = 'txt', subdir = T)), 
            quote = F, row.names = F, col.names = F)


syrgenesup05 <- unique(psitestats$Gene.names[psitestats$adj.P.Val.nullfiltisotope < 0.05 & !is.na(psitestats$adj.P.Val.nullfiltisotope) &
                                             !is.na(psitestats$logFC.nullfiltisotope) & psitestats$logFC.nullfiltisotope > 0])
syrgenesup05 <- gsub(";.*","", syrgenesup05)
write.table(syrgenesup05, file = as.character(filenamer::filename('syrgenes_up_qval05', ext = 'txt', subdir = T)), 
            quote = F, row.names = F, col.names = F)

 
syrgenesdown01 <- unique(psitestats$Gene.names[psitestats$adj.P.Val.nullfiltisotope < 0.01 & !is.na(psitestats$adj.P.Val.nullfiltisotope) &
                                             !is.na(psitestats$logFC.nullfiltisotope) & psitestats$logFC.nullfiltisotope < 0])
syrgenesdown01 <- gsub(";.*","", syrgenesdown01)
write.table(syrgenesdown01, file = as.character(filenamer::filename('syrgenes_down_qval01', ext = 'txt', subdir = T)), 
            quote = F, row.names = F, col.names = F)


syrgenesup01 <- unique(psitestats$Gene.names[psitestats$adj.P.Val.nullfiltisotope < 0.01 & !is.na(psitestats$adj.P.Val.nullfiltisotope) &
                                           !is.na(psitestats$logFC.nullfiltisotope) & psitestats$logFC.nullfiltisotope > 0])
syrgenesup01 <- gsub(";.*","", syrgenesup01)
write.table(syrgenesup01, file = as.character(filenamer::filename('syrgenes_up_qval01', ext = 'txt', subdir = T)), 
            quote = F, row.names = F, col.names = F)